Language-Independent Text Parsing of Arbitrary HTML-Documents. Towards A Foundation For Web Genre Identification

نویسنده

  • Georg Rehm
چکیده

This article describes an approach to parsing and processing arbitrary web pages in order to detect macrostructural objects such as headlines, explicitlyand implicitly-marked lists, and text blocks of different types. The text parser analyses a document by means of several processing stages and inserts the analysis results directly into the DOM tree in the form of XML elements and attributes, so that both the original HTML structure, and the determined macrostructure are available at the same time for secondary processing steps. This text parser is being developed for a novel kind of search engine that aims to classify web pages into web genres so that the search engine user will be able to specify one or more keywords, as well as one or more web genres of the documents to be found.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Automatic Web Genre Identification: A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage

We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type ...

متن کامل

Concurrent programming on the web with Webstream

We describe Webstream, a language to simplify the development of client-side web applications, particularly web-aware information agents. Webstream encapsulates web documents as streams of messages passing between concurrent lightweight threads, permitting operations to be carried out lazy-evaluation style while documents are in the process of being retrieved. Streams can be pipelined through f...

متن کامل

ThesWB: A Tool for Thesaurus Construction from HTML Documents

Electronically available documents on the Web are exploding at an ever-increasing rate. Many Web documents, however, contain rich knowledge that describes the document's content. The Web can be viewed as a body of text containing two fundamentally different types of data: the contents and the tags. A tag is in HTML (Hyper-Text Markup Language) meta-data describing the layout and linking structu...

متن کامل

Exploring Impacts of Consciousness-raising in a Genre-based Pedagogy

This study reports on the findings of a genre teaching course for developing academic writing of a class of EFL students in Iran. The information report genre was taught in a cyclical way of teaching and learning, which was started from ‘setting the context’ and ‘deconstruction’ of prototype information report genre, and continued with ‘joint construction’, ‘independent construction’, and final...

متن کامل

Complementary Approaches to Representing Differences Between Structured Documents

Structured documents Documents can be represented as structures with a hierarchical arrangement of text and non-text nodes, where nodes are labelled by category names such as “paragraph” and “section”. Representing documents this way is a natural consequence of using the Standard Generalized Markup Language (SGML) to encode the content and form of documents [10, 11, 7]. SGML is widely used. HTM...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • LDV Forum

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2005